The original dataset bullying_network.xlsx is an Excel file that contains two sheets about bullying within a group of 25 students in an Italian school class: the first one encloses three attributes per student; the second one represents the adjacency matrix of a directed network depicting who bullies whom among the students.
The first sheet, the attributes one, contains:
As forehead described, the original Excel file contains two sheets. To retrieve information there stored, a first approach exploited the read_excel(path, sheet) function and saved the two sheets in different variables. Since this attempt implied a network misreading of 0s and 1s as characters instead of double and creating an undirected graph, we opted for another importing action: the original Excel file has been separated manually into two different csv files:
bullying_attributes.csv: containing the attributes of students inside the network;bullying_network.csv: containing the actual adjacency matrix (i.e. the network).library(tidyverse)
library(sna)
library(knitr)
library(itsadug) # For gradient legend
library(formattable) # For nice tables
library(igraph)
library(visNetwork)With the following code, the network is imported, converted into matrix and then into directed network.
# Importing data
net = as.matrix(read.csv("bullying_network.csv",
stringsAsFactors=FALSE, row.names=1))
# Converting the adjacency matrix to a network
net = as.network(net, directed=T)
# Inspect the network
summary(net, print.adj = FALSE)## Network attributes:
## vertices = 25
## directed = TRUE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## bipartite = FALSE
## total edges = 76
## missing edges = 0
## non-missing edges = 76
## density = 0.1266667
##
## Vertex attributes:
## vertex.names:
## character valued attribute
## 25 valid vertex names
##
## No edge attributes
By inspecting the obtained network through summary(net), we may notice:
Students attributes have been saved in a distinct file and, therefore, imported separately.
# Reading the csv file with node attributes with given types for summary
attr = read.csv("bullying_attributes.csv",
colClasses = c("character","factor","numeric","factor"))
summary(attr)## ID Gender Grade Ethnicity
## Length:25 0:12 Min. : 3.00 1: 5
## Class :character 1:13 1st Qu.: 5.00 2: 6
## Mode :character Median : 6.00 3:14
## Mean : 6.16
## 3rd Qu.: 8.00
## Max. :10.00
As can be seen from the summary, there are more males than females, but this distinction is not significant since the class has an odd number of students. Instead, ethnicity registers a more significant difference, showing 11 immigrants and 14 Italians. The minimum grade is 3, although marks start typically from 0, while the maximum is 10. Both median and mean of the class are around 6, which corresponds to the sufficiency in the Italian system.
# Re-importing as numerical for the following plots and measures
attr = read.csv("bullying_attributes.csv")par(mar=c(0,0,0,0)) # Margin deletion
gplot(net,
gmode="digraph", # directed network
mode="fruchtermanreingold", # set mode as Fruchterman-Reingold
jitter=F, # do not allow nodes to be "jittered"
edge.col="grey70", # set color of ties
vertex.col="cornflowerblue", # set color of nodes
vertex.sides = 100,
displaylabels=T, # indicate that labels should be included
label.pos=1, # indicate that labels should be given below points
label.cex=.7, # indicate the size of the labels (1 is default)
arrowhead.cex = 0.4)The Fruchterman-Reingold layout is the default mode of the gplot() function, and therefore it is not necessary to include it. Overall there are 25 labelled nodes connected through directed links representing who bullies whom. In particular, there is a student, Roberto, that does not present incoming or outgoing edges and, following the definition of Wasserman & Faust (1994), it is an isolate1. Therefore, Roberto is the only student in this specific school class that is not bullied and does not bully either.
Once again, let’s consider the three attributes and their respective values:
These three attributes are all categorical data and, specifically, can be divided into nominal (gender and ethnicity) and ordinal (grade). This distinction implies a further difference in the visualisation. Whereas the nominal values can be mapped using variation in colour hue or shape, the ordinal ones may lead to different conclusions. Below two versions are displayed, using a combination of shape, colour and size. The first one variates the datum size (area) and fits data based on proportional scales where the larger sizes mean greater quantities (e.g., grade 10). The second option exploits the colours’ lightness. Thus, lightness variation can represent quantitative scales, where the darker the colour, the greater the quantity2.
A first version considers the variations in:
colours_1 = c("#2a9d8f", "#e9c46a", "#e76f51")
sizes = seq(0.1, 4, length.out=11)
par(mar=c(0,0,0,0))
gplot(net,
gmode="digraph", # directed network
jitter=T,
edge.col="grey70", # set color of ties
vertex.col=colours_1[attr$Ethnicity], # set color of nodes depending on ethnicity
vertex.cex = sizes[attr$Grade], # set size of nodes depending on grade
vertex.sides= (attr$Gender-1)*47+4, # set shape of nodes depending on gender
vertex.rot = 45, # Nodes rotation
arrowhead.cex = 0.5) # Size of the arrows headIn this representation, both gender and ethnicity are evident, while the size does not accurately represent the grade. Moreover, to visualize the grade, a size vector proportionally maps the node’s size to its grade. Although this combination satisfies the assigned requirements, the user might experience difficulties distinguishing the grades differences.
In the following chunk, a legend is added to guide the user in the reading of the network:
sizes = seq(0.2, 3, length.out=11)
par(mar=c(0,0,0,0)) # Margin deletion
gplot(net,
gmode="digraph", # directed network
jitter=T,
edge.col="grey70", # set color of ties
vertex.col=colours_1[attr$Ethnicity], # set color of nodes depending on ethnicity
vertex.cex = sizes[attr$Grade], # set size of nodes depending on grade
vertex.sides= (attr$Gender-1)*47+4, # set shape of nodes depending on gender
vertex.rot = 45,
arrowhead.cex = 0.5,
displaylabels=T, # indicate that labels should be included
label.pos=1, # indicate that labels should be given below points
label.cex=.7) # indicate the size of the labels (1 is default)
legend("topleft",
legend = c("Ethnicity"," 1st gen immigrants", " 2st gen immigrants", " Italian",
" ", "Gender", " Male", " Female",
" ", "Grade", 0:10),
col = c("white",colours_1[1],colours_1[2],colours_1[3],
"white","white","black","black",
"white", "white", rep("black",11)),
bty = "n",
pch =c(15,15,15,15,
15,15,19,15,
15,15,rep(19,11)),
pt.cex = c(0,2.1,2.1, 2.1,
0,0,2.1,2.1,
0, 0, (0:10)/5),
cex = 1,
text.col = "grey10",
horiz = F ,
inset = c(0.01))As stated above, the second approach represents the attributes in a different way:
colours_2 = c('#fff0f3', '#ffccd5', '#ffb3c1', '#ff8fa3', '#ff758f', # Reds
'#ff4d6d', '#c9184a', '#a4133c', '#800f2f', '#590d22', # Reds
"#d8f3dc","#b7e4c7","#95d5b2","#74c69d","#52b788", # Greens
"#40916c", "#2d6a4f","#245741","#1b4332","#081c15" # Greens
)
shapes = c(3,4,50)
par(mar=c(0,0,0,0)) # Margin deletion
gplot(net,
gmode="digraph", # directed network
jitter=F, # do not allow nodes to be "jittered"
edge.col="grey70", # set color of ties
vertex.col=colours_2[attr$Gender*10+attr$Grade+1], # set color of nodes
vertex.cex = 1.3,
vertex.rot = 45,
vertex.sides= shapes[attr$Ethnicity],
arrowhead.cex = 0.5)Despite ethnicity losing evidence, the grade variable acquires a central role in this visualisation. Thus, the use of colour lightness to highlight the grades (i.e. higher grades with darker colours) emphasizes the differences more than in the previous case where size was employed.
Notice that the exercise also required the use of size. However, applying size variation to gender or ethnicity may unconsciously suggest a social superiority of one of the two categories, which does not exist. Thus, this visualisation does not consider the size to avoid any possible misunderstanding.
In this case, the legend shows:
This solution may help understand the grade distribution among students, but getting the exact grade is still challenging.
par(mar=c(0,0,0,0)) # Margin deletion
gplot(net,
gmode="digraph", # directed network
jitter=T, # do not allow nodes to be "jittered"
edge.col="grey70", # set color of ties
vertex.col=colours_2[attr$Gender*10+attr$Grade+1], # set color of nodes
vertex.cex = 1,
vertex.rot = 45,
vertex.sides= shapes[attr$Ethnicity],
displaylabels=T, # indicate that labels should be included
arrowhead.cex = 0.5,
label.pos=1, # indicate that labels should be given below points
label.cex=.7) # indicate the size of the labels (1 is default))
# Legend for gender and ethnicity
legend("topleft",
legend = c("Ethnicity"," 1st gen immigrants", " 2st gen immigrants", " Italian",
" ", "Gender", " Male", " Female",
" ", "Grade"),
col = c("white","black","black", "black",
"white","white","#52b788","#ff4d6d",
"white", "white"),
bty = "n",
pch =c(15,17,15,19,
15,15,15,15,
15, 15),
pt.cex = c(0,2.1,2.1, 2.2,
0,0,2.1,2.1,
0,0),
cex = 1,
text.col = "grey10",
horiz = F ,
inset = c(0.01))
# Gradient legend
pal = colorRampPalette(colours_2[1:10])
gradientLegend(valRange=c(0,10),
color = pal(10),
pos=c(0.05,0.1,.08,0.5),
side=4,
n.seg=1,
border.col = "grey10",
inside = T)
pal = colorRampPalette(colours_2[11:20])
gradientLegend(valRange=c(0,10),
color = pal(10),
pos=c(0.12,0.1,.15,0.5), side=4,
n.seg=1,
border.col = "grey10",
inside = T)This third version exploits the package visNetwork to build an interactive visualization of the network, reproducing the first version forehead proposed, where:
Moreover, ethnicity can be chosen through the selection above the plot.
# Creation of the igraph network
i_net<-graph_from_adjacency_matrix(as.matrix(read.csv("bullying_network.csv",
row.names = 1,
stringsAsFactors = F)),
mode="directed", diag=F)
# Setting colours, shapes, sizes, ethnicity, gender, title
V(i_net)$color = colours_1[attr$Ethnicity]
V(i_net)$shape = c("dot","square")[attr$Gender+1]
V(i_net)$size = attr$Grade*3
V(i_net)$ethnicity = ifelse(attr$Ethnicity==3, "Italian",
ifelse(attr$Ethnicity == 1,"1st immigrant", "2nd immigrant"))
V(i_net)$gender = ifelse(attr$Gender == 0, "Female","Male")
V(i_net)$title = paste0("<p><b>", attr$ID,"</b><br>",
V(i_net)$gender,", ",attr$Grade,", ",
V(i_net)$ethnicity,"</p>")
# Dataframe for the legend
lnodes <- data.frame(label = c("1st immigrants","2nd immigrants","Italians","Females", "Males"),
shape = c(rep("triangle",3),"dot","square"),
color = c(colours_1,"#B78FB5","#B78FB5"),
font.size = 10,
title = "Informations")
# Creation of the visNetwork object
new_net = toVisNetworkData(i_net)
# Actual visualisation
visNetwork(nodes = new_net$nodes,
edges = new_net$edges,
height = "800px", width = "100%",
font.size = 9,
physics = T) %>%
visEdges(arrows = "to") %>%
visOptions(highlightNearest = TRUE,
selectedBy = "ethnicity") %>%
visIgraphLayout(layout = "layout.fruchterman.reingold") %>%
visLegend(addNodes = lnodes, useGroups = FALSE, ncol = 1)The interactive version offers better exploitation and comprehension of the network through the possibility of visualising all the information related to a node by just clicking on it. By introducing a further layer, the problem of grades definition and comparison is solved.
Degree centrality is defined as the number of links incident upon a node. Having a directed network implies splitting the degree into in-degree and out-degree. The former counts the number of incoming edges to a node, while the latter refers to the number of edges a node directs to others3.
When edges express positive aspects (e.g., friendship, collaboration), in-degree can be interpreted as a form of popularity and out-degree as gregariousness. However, a reversed scenario is presented here due to the edges’ negative connotation, i.e. bullying.
Let’s first consider the in-degree centrality of nodes with the function degree(network, gmode, cmode="indegree"), which will return the in-degree for each student. The same function can be used to compute the out-degree, setting cmode="outdegree".
# How many people are you bullied from?
indegree = sna::degree(net, gmode="digraph", cmode = "indegree")
# How many people do you bully?
outdegree = sna::degree(net, gmode="digraph", cmode = "outdegree")
# Overall Degree
degree = sna::degree(net, gmode="digraph")
attr$indegree = indegree
attr$outdegree = outdegreeTo properly present the degree centrality measures, a dataframe has been created with students names, in-degree, out-degree and total degree (i.e. the sum of both in and out-degree). In addition, the normalized in-degree is computed by dividing the in-degree over the number of nodes minus \(1\) (i.e. \(n-1\)). The last column indicates whether the single node in-degree is below (or above) the average.
| Name | In degree | Out degree | Degree | Normalized In degree | Above Average (In degree) |
|---|---|---|---|---|---|
| Sara | 9 | 2 | 11 | 0.38 | |
| Gianna | 8 | 1 | 9 | 0.33 | |
| Marco | 8 | 1 | 9 | 0.33 | |
| Andrea | 7 | 2 | 9 | 0.29 | |
| Mario | 7 | 2 | 9 | 0.29 | |
| Valentina | 6 | 0 | 6 | 0.25 | |
| Giulia | 4 | 4 | 8 | 0.17 | |
| Laura | 3 | 0 | 3 | 0.12 | |
| Sofia | 3 | 2 | 5 | 0.12 | |
| Angela | 2 | 7 | 9 | 0.08 | |
| Angelo | 2 | 6 | 8 | 0.08 | |
| Francesco | 2 | 7 | 9 | 0.08 | |
| Giovanni | 2 | 3 | 5 | 0.08 | |
| Giuseppe | 2 | 2 | 4 | 0.08 | |
| Maria | 2 | 1 | 3 | 0.08 | |
| Rosa | 2 | 5 | 7 | 0.08 | |
| Stella | 2 | 6 | 8 | 0.08 | |
| Anna | 1 | 6 | 7 | 0.04 | |
| Aurora | 1 | 7 | 8 | 0.04 | |
| Flavio | 1 | 1 | 2 | 0.04 | |
| Luigi | 1 | 4 | 5 | 0.04 | |
| Stefano | 1 | 1 | 2 | 0.04 | |
| Alessandro | 0 | 5 | 5 | 0.00 | |
| Luca | 0 | 1 | 1 | 0.00 | |
| Roberto | 0 | 0 | 0 | 0.00 |
By considering the in-degree centrality, Sara can be defined as the most central node, followed by Gianna and Marco. As stated above, whereas having many incoming edges represents an advantageous condition for the individual in several cases, this specific network emphasizes an unfavourable status. Thus, in-degree measure refers to the number of bullies a node must face. Consequently, Sara results the chosen victim and gets bullied by \(9\) people, which corresponds to the \(0.375\%\) of the network (normalization obtained by \(\frac{in-degree}{n-1} = \frac{9}{24} = 0.375\), with \(n\) number of nodes).
In the worst-case scenario, Sara would be bullied by \(24\) students, and she would be the only victim. However, despite her seeming far from the depicted scenario, her in-degree looks exceptionally high considering the average and the median in-degree. The same happens for students like Gianna, Marco, Andrea, Mario, Valentina and Giulia.
It is also worth noticing that Sara’s condition seems not to be determined by her ethnicity or gender, but by her grade. While at least one member per ethnicity and/or per gender has been bullied, the reason for which Sara is bullied by several students might be the achievement of the highest mark: \(10\).
| Measures | Value |
|---|---|
| Most central node | Sara |
| Max in degree | 9 |
| (Theoretical) max in degree | 24 |
| Least central nodes | Alessandro, Luca, Roberto |
| Min in degree | 0 |
| (Theoretical) min in degree | 0 |
| Average in degree | 3.04 |
| Median in degree | 2 |
As the bar plot below shows, the average is strongly impacted by the unbalanced situation within the network, meaning a group of highly bullied victims against the remaining nodes assessing low in-degree values. The extreme cases (\(in-degree = 0\)) are further analysed in the following sections and might reflect either isolate (i.e. Roberto) or bully figures.
Note that the following network nodes size indicates whether the in-degree is above average (bigger) or not (smaller), exploiting the interactive version proposed in section Third interactive version.
There is no unique central node in the case of the out-degree measure. Thus, Angela, Aurora, and Francesco can be detected as the \(3\) most central nodes (as shown in the table below). It is also worth noticing that their degree measure is assessed at \(9\) and \(8\) and, therefore, they are not the extreme cases hypothesized above (i.e., \(in-degree=0\)). As already performed in the in-degree case, some reference points are computed and reported below.
| Name | Out degree | In degree | Degree | Normalized Out degree | Above Average (Out degree) |
|---|---|---|---|---|---|
| Angela | 7 | 2 | 9 | 0.292 | |
| Aurora | 7 | 1 | 8 | 0.292 | |
| Francesco | 7 | 2 | 9 | 0.292 | |
| Angelo | 6 | 2 | 8 | 0.250 | |
| Anna | 6 | 1 | 7 | 0.250 | |
| Stella | 6 | 2 | 8 | 0.250 | |
| Alessandro | 5 | 0 | 5 | 0.208 | |
| Rosa | 5 | 2 | 7 | 0.208 | |
| Giulia | 4 | 4 | 8 | 0.167 | |
| Luigi | 4 | 1 | 5 | 0.167 | |
| Giovanni | 3 | 2 | 5 | 0.125 | |
| Andrea | 2 | 7 | 9 | 0.083 | |
| Giuseppe | 2 | 2 | 4 | 0.083 | |
| Mario | 2 | 7 | 9 | 0.083 | |
| Sara | 2 | 9 | 11 | 0.083 | |
| Sofia | 2 | 3 | 5 | 0.083 | |
| Flavio | 1 | 1 | 2 | 0.042 | |
| Gianna | 1 | 8 | 9 | 0.042 | |
| Luca | 1 | 0 | 1 | 0.042 | |
| Marco | 1 | 8 | 9 | 0.042 | |
| Maria | 1 | 2 | 3 | 0.042 | |
| Stefano | 1 | 1 | 2 | 0.042 | |
| Laura | 0 | 3 | 3 | 0.000 | |
| Roberto | 0 | 0 | 0 | 0.000 | |
| Valentina | 0 | 6 | 6 | 0.000 |
| Measures | Value |
|---|---|
| Most central node | Angela, Aurora, Francesco |
| Max out degree | 7 |
| (Theoretical) max out degree | 24 |
| Least central nodes | Laura, Roberto, Valentina |
| Min out degree | 0 |
| (Theoretical) min out degree | 0 |
| Average out degree | 3.04 |
| Median out degree | 2 |
Each of the three central out-degree nodes bullies almost one-third of the overall network:
\[ \left[ \frac{outdegree}{n-1} = \frac{7}{24} = 0.291 \approx 0.29 \approx \frac{1}{3}\right] \]
These values are highly above the average of 3.04 \(\pm\) 2.423 but do not reach extreme cases, such as in a completely centralized network (i.e. maximum out-degree). Moreover, a visual interpretation of the out-degree distribution in a bar plot graph may help understand the underlying reasons for the low reference points.
Once again, the out-degree measure shows an imbalanced situation assessing a significant gap between the lowest extreme cases (e.g. \(0\)) and the highest ones (i.e.,\(7\)). Overall, \(15\) nodes have an out-degree below average, while the remaining \(10\) nodes score slightly above average or double it in the most central nodes.
Note that the following network nodes size indicates whether the out-degree is above average (bigger) or not (smaller), exploiting the interactive version proposed in section Third interactive version.
grade = attr$Grade
cor_value = cor(indegree, grade, method = c("pearson"))
cor_value## [1] 0.852993
The Pearson coefficient on in-degree and grade, namely the correlation coefficient, is slightly above 0.853. Therefore, a strong and positive linear relationship can be detected between being bullied (i.e., in-degree) and the school results. The number of bullies of a kid linearly depends on their grades. Thus, better marks correspond to a higher probability of being bullied.
cor.test(indegree,grade)##
## Pearson's product-moment correlation
##
## data: indegree and grade
## t = 7.838, df = 23, p-value = 6.082e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6906372 0.9334951
## sample estimates:
## cor
## 0.852993
Moreover, the correlation test shows a low p-value, indicating a statistically significant relationship between the number of people that bully and the grade they get, illustrated by the following scatter plot. In particular, each data point is described in terms of grade (y-axis) and in-degree (x-axis). The linear relationship between these variables is reported along with its \(95%\) confidence interval.
However, despite the high correlation coefficient, it is worth noticing some fallacies in the model:
For confirming the relationship between grade and being bullied, the ANOVA test is applied, showing a low p-value and therefore a discriminant factor for the studied network. In particular, the pairwise test displays pairs of grades where there is a substantial difference in terms of being bullied. For instance, \(8\) and \(3\) grades tend to be inside the mechanism of bullying, while \(4\) and \(3\) don’t.
res.aov <- aov(indegree ~ as.factor(Grade), data = attr)
summary(res.aov)## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Grade) 7 157.44 22.491 14.98 3.87e-06 ***
## Residuals 17 25.52 1.501
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(res.aov)## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = indegree ~ as.factor(Grade), data = attr)
##
## $`as.factor(Grade)`
## diff lwr upr p adj
## 4-3 -0.5000000 -4.7091325 3.709132 0.9998699
## 5-3 1.3333333 -2.1034089 4.770076 0.8739953
## 6-3 1.6428571 -1.7319568 5.017671 0.7035655
## 7-3 0.5000000 -4.6551134 5.655113 0.9999667
## 8-3 6.1666667 2.3242720 10.009061 0.0007910
## 9-3 5.8333333 1.9909387 9.675728 0.0014273
## 10-3 8.5000000 3.3448866 13.655113 0.0005882
## 5-4 1.8333333 -1.6034089 5.270076 0.6088329
## 6-4 2.1428571 -1.2319568 5.517671 0.4078165
## 7-4 1.0000000 -4.1551134 6.155113 0.9969257
## 8-4 6.6666667 2.8242720 10.509061 0.0003312
## 9-4 6.3333333 2.4909387 10.175728 0.0005905
## 10-4 9.0000000 3.8448866 14.155113 0.0003084
## 6-5 0.3095238 -2.0322213 2.651269 0.9997359
## 7-5 -0.8333333 -5.3797160 3.713049 0.9978363
## 8-5 4.8333333 1.8570272 7.809639 0.0006956
## 9-5 4.5000000 1.5236939 7.476306 0.0014893
## 10-5 7.1666667 2.6202840 11.713049 0.0009599
## 7-6 -1.1428571 -5.6426090 3.356895 0.9848795
## 8-6 4.5238095 1.6192322 7.428387 0.0010913
## 9-6 4.1904762 1.2858988 7.095054 0.0023970
## 10-6 6.8571429 2.3573910 11.356895 0.0013723
## 8-7 5.6666667 0.8063791 10.526954 0.0162923
## 9-7 5.3333333 0.4730458 10.193621 0.0260442
## 10-7 8.0000000 2.0473878 13.952612 0.0047484
## 9-8 -0.3333333 -3.7700756 3.103409 0.9999667
## 10-8 2.3333333 -2.5269542 7.193621 0.7167896
## 10-9 2.6666667 -2.1936209 7.526954 0.5775206
Instead, by focusing on the grade and bullying (i.e. out-degree), the p-value results high, leading to the conclusion that the relationship is not statistically significant.
res.aov <- aov(outdegree ~ as.factor(Grade), data = attr)
summary(res.aov)## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Grade) 7 60.58 8.654 1.83 0.146
## Residuals 17 80.38 4.728
This section hosts t-tests which try to identify differences in terms of mean of two different groups: males and females. These tests were led on in-degree and out-degree. However, no significant p-value was obtained from these tests, meaning that gender is not discriminant in the bullying behaviour inside the class.
# Mean difference between males and females (in degree)
t.test(filter(attr,Gender==0)$indegree,
filter(attr,Gender==1)$indegree)##
## Welch Two Sample t-test
##
## data: filter(attr, Gender == 0)$indegree and filter(attr, Gender == 1)$indegree
## t = 0.94555, df = 22.988, p-value = 0.3542
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.241156 3.330900
## sample estimates:
## mean of x mean of y
## 3.583333 2.538462
# Mean difference between males and females (out degree)
t.test(filter(attr,Gender==0)$outdegree,
filter(attr,Gender==1)$outdegree)##
## Welch Two Sample t-test
##
## data: filter(attr, Gender == 0)$outdegree and filter(attr, Gender == 1)$outdegree
## t = 0.73287, df = 21.115, p-value = 0.4717
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.330422 2.779140
## sample estimates:
## mean of x mean of y
## 3.416667 2.692308
For completeness, some tests are run for ethnicity too. Since there are three ethnic groups among students, ANOVA tests investigate pairwise differences. As TukeyHSD() shows, the p-values are not significant and therefore ethnicity is not discriminant for bullying or being bullied.
# Being bullied and ethnicity
res.aov <- aov(indegree ~ as.factor(Ethnicity), data = attr)
summary(res.aov)## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Ethnicity) 2 0.45 0.223 0.027 0.974
## Residuals 22 182.51 8.296
TukeyHSD(res.aov)## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = indegree ~ as.factor(Ethnicity), data = attr)
##
## $`as.factor(Ethnicity)`
## diff lwr upr p adj
## 2-1 0.2000000 -4.181304 4.581304 0.9927790
## 3-1 0.3428571 -3.426744 4.112459 0.9716690
## 3-2 0.1428571 -3.387698 3.673412 0.9943215
# Bullying and ethnicity
res.aov <- aov(outdegree ~ as.factor(Ethnicity), data = attr)
summary(res.aov)## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Ethnicity) 2 2.43 1.213 0.193 0.826
## Residuals 22 138.53 6.297
TukeyHSD(res.aov)## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = outdegree ~ as.factor(Ethnicity), data = attr)
##
## $`as.factor(Ethnicity)`
## diff lwr upr p adj
## 2-1 -0.9333333 -4.750420 2.883754 0.8139202
## 3-1 -0.6000000 -3.884159 2.684159 0.8909550
## 3-2 0.3333333 -2.742563 3.409230 0.9600440
A permutation-based test (also called re-randomization test) is a statistical significance test in which the distribution of the test under the null hypothesis is obtained by calculating all possible values of the test statistic under all possible rearrangements of the observed data points4.
This test is performed on \(1000\) permutations of the grade over students’ in-degree while computing the Pearson correlation coefficient between them. As the histogram shows, after all the permutations, a Gaussian distribution is reached within the range of values of Pearson coefficient (i.e. \([-1,1]\)), since it is likely to find no correlations between random permutations of grade and in-degree than perfect ones. Moreover, as indicated by the straight vertical line, the Pearson coefficient \(\rho\) indicates that, after \(1000\) permutations, the correlation is more statistically significant than by chance.
set.seed(21)
permutation = matrix(NA,1000,1)
for (k in c(1:1000))
{
grade_perm = sample(grade)
permutation[k,1] = cor(indegree, grade_perm)
}
# Histogram
hist(permutation, prob=TRUE,
xlim=c(-1.0,1.0), ylim = c(0.0, 2.0),
xlab = "Value of permutation", col = "#C7A8C5")
x <- seq(min(permutation), max(permutation))
# Gaussian curve
curve(dnorm(x, mean=mean(permutation), sd=sd(permutation)),
col="#7C5079", lwd=2, add=TRUE)
text(x = 0.56, y = cor(indegree, grade),
expression(paste("N~(", mu, "= 0.0, ", sigma, "= 0.2)")), col = "#3E283D")
# P-value line
abline(v=cor(indegree, grade), lwd=2, col="#7C5079")
text(x = cor(indegree, grade)+0.1, y = cor(indegree, grade),
expression(paste(rho, ": 0.85")), col = "#7C5079") As the histogram above shows, after all the permutations, a Gaussian distribution is reached within the range of values of Pearson coefficient (i.e. \([-1,1]\)), whose mean is around \(0\). This happens because it is more likely to find no correlation (i.e. \(0\)) between random permutations of grade and in-degree than perfect ones (i.e. \(-1,1\)). Moreover, as indicated by the straight vertical line at \(x = 0.85\), the Pearson coefficient of the initial sample is far away from the average correlation obtained by permuting data.
pnorm(cor_value, mean = mean(permutation), sd = sd(permutation), lower.tail = T)## [1] 0.9999727
The function pnorm() assesses that the correlation coefficient of the original sample is above the \(99%\) of the coefficients computed by randomly permuting data. Hence, obtaining a Pearson coefficient $= $0.853 would be possible in less than \(1\%\) of the cases and it is therefore statistically significant.
Centralization is a question about how central some nodes are compared to other nodes, and it is a characteristic of the network itself. It is related to the distribution of the centrality inside the network5, while centrality is more related to the individual sphere of the nodes6.
In an undirected network, centralization is computed as follows:
\[ Centralization = \frac{\sum (\max-degree_i)}{(n-1)(n-2)} \] The most centralized network is the one that has a single node in the centre, which links every other node in the network. In this case, the maximum degree is \((n-1)\) and the centralization index returns \(1\):
\[ C_{max} = \frac{(24-1) \cdot 24+(24-24)\cdot1}{24\cdot23} = \frac{23\cdot24}{23\cdot24} = 1 \]
In the opposite case, namely in a decentralized network, every node is linked to everybody else and the centralization index is \(0\):
\[ C_{min} = \frac{(24-24)\cdot25}{24\cdot23} = \frac{0}{23\cdot24} = 0 \]
# Total degree centralization
sna::centralization(net,sna::degree)## [1] 0.111413
In the bullying network, the centralization index is 0.111, which indicates a reasonably decentralized network.
Since the bullying network is directed, it is possible to split the centralization index into in-degree and out-degree. As the table below shows, the in-degree centralization index is higher than the out-degree one, indicating a more centralized bullied network (i.e. few kids are bullied from many) and a less centralized bullying network (i.e. nearly everybody bullies others).
Notice that both indexes are higher than the degree centralization index (i.e. 0.111), suggesting the presence of two more centralized sub-networks.
table = formattable(data.frame(
"Centralization index" = round(sna::centralization(net,sna::degree),3),
# In degree centralization
"In degree Centralization Index" = round(sna::centralization(net,sna::degree, cmode="indegree"),3),
# Out degree centralization
"Out degree Centralization Index" = round(sna::centralization(net,sna::degree, cmode="outdegree"),3)))
names(table) = c("Centralization Index", "In degree Centralization Index", "Out degree Centralization Index")
table| Centralization Index | In degree Centralization Index | Out degree Centralization Index |
|---|---|---|
| 0.111 | 0.259 | 0.172 |
Moreover, considering the two hypothetical extremes, meaning the maximum (i.e. \(1\)) and minimum (i.e. \(0\)) centralization indexes, it can be noticed that the computed ones are closer to the minimal reference point than to the maximal. It is worth noticing that the sub-networks (i.e. bullying and being bullied) are decentralized due to multiple central nodes.
Before computing the reciprocity index, a census about dyads is performed to understand the state-of-art, meaning how many mutual, asymmetric and null dyads are present.
sna::dyad.census(net)## Mut Asym Null
## [1,] 4 68 228
Given \(25\) nodes, the maximal number of edges is \((25 \cdot 24)/2 = 300\), while the minimal is \(0\). In this specific network, there are:
# Mutual edges
E(i_net)[which_mutual(i_net)]## + 8/76 edges from 16fec93 (vertex names):
## [1] Andrea ->Francesco Angelo ->Giulia Flavio ->Giulia
## [4] Francesco->Andrea Giulia ->Angelo Giulia ->Flavio
## [7] Marco ->Sara Sara ->Marco
By looking at the table below with the nodes involved in the mutual edges, we can notice that:
| Name | In degree | Out degree | Degree | Normalized In degree | Normalized Out degree |
|---|---|---|---|---|---|
| Andrea | 7 | 2 | 9 | 0.29 | 0.083 |
| Angelo | 2 | 6 | 8 | 0.08 | 0.250 |
| Flavio | 1 | 1 | 2 | 0.04 | 0.042 |
| Francesco | 2 | 7 | 9 | 0.08 | 0.292 |
| Giulia | 4 | 4 | 8 | 0.17 | 0.167 |
| Marco | 8 | 1 | 9 | 0.33 | 0.042 |
| Sara | 9 | 2 | 11 | 0.38 | 0.083 |
As a further step, the reciprocity index is computed as the number of mutual dyads (i.e. \(M\)) in the network over the total amount of edges, namely the sum of mutual (\(M\)) and asymmetric edges (\(A\)):
\[ Reciprocity = \frac{2\cdot M}{2\cdot M+A} \]
sna::grecip(net, measure="edgewise")## Mut
## 0.1052632
The computation above yields a reciprocity rate of 0.105, meaning that almost \(11 \%\) of the network drives to mutual relations between nodes (i.e., counterattack technique).
\[ Reciprocity = \frac{2\cdot 4}{2\cdot 4+68} = \frac{8}{76} = 0.105 \approx 11\% \]
Therefore, the arc-based reciprocity (also called edgewise or tie reciprocity) suggests a lack of mutual edges compared to the asymmetric ones, but not their absence. This statement should also be contextualized considering some reference points, such as the minimum and maximum possible reciprocity values. The former refers to the absence of reciprocity, namely a network with only asymmetric (or null) edges, and assesses a reciprocity rate of \(0\):
\[ R_{min} = \frac{2\cdot M}{2\cdot M+A} = \frac{2\cdot 0}{2\cdot 0+A}=0 \]
The latter instead assumes that all edges within the network are mutual. Therefore, the highest reachable value would be \(1\):
\[ R_{max} = \frac{2\cdot M}{2 \cdot M+A} = \frac{2\cdot M}{2\cdot M}=1 \] Instead, considering the dyadic measure for reciprocity, the proportion of mutual and null edges over the total, namely the sum of mutual \(M\), asymmetric \(A\) and null edges \(N\), yields an index of 0.773.
\[ R_{\text{dyadic}} = \frac{M+N}{M+N+A} \]
sna::grecip(net, measure="dyadic")## Mut
## 0.7733333
Thanks to a different interpretation of the dyads, the obtained reciprocity value is considerably higher than before. Thus, the initial assumption relies on the definition of null edge as a symmetric non-relation, implying an absence of bullying acts from both sides. The census results highlight how the null edges represent the major type of edges (i.e., \(228\) over \(300\)), making possible to understand the high reciprocity rate.
A clear demonstration of null edges’ strong influence can be driven by computing the ratio of mutuals to non-null dyads.
\[ R_{\text{dyadic non null}} = \frac{M}{M+A} \]
The index falls to 0.056 and indicates that over the \(5\%\) of the non-null dyads is mutual.
sna::grecip(net, measure="dyadic.nonnull")## Mut
## 0.05555556
Wasserman, S., & Faust, K. (1994). Social network analysis: Methods and applications↩︎
Kirk, A. (2012). Data visualisation: a successful design process. Packt publishing LTD.↩︎
Freeman, L. C. (1978), Centrality in social networks conceptual clarification. Social networks, 1(3), 215-239.↩︎
Pitman, E. J. G. (1937). Significance Tests Which May be Applied to Samples From any Populations. Supplement to the Journal of the Royal Statistical Society, 4(1), 119–130.↩︎
Freeman, L. C. (1978). Centrality in social networks conceptual clarification. Social networks, 1(3), 215-239.↩︎
Everett, M. G., & Borgatti, S. P. (1999). The centrality of groups and classes. The Journal of mathematical sociology, 23(3), 181-201.↩︎